AI Content Detection
How DrillBit detects AI-generated content
A detailed look at the science, methodology, and validated performance behind
DrillBit's AI detection engine — built for academic institutions that demand accuracy,
transparency, and fairness.
↓ Download full
accuracy white paper
93%
Human Detection Accuracy
83%
AI Detection Accuracy
88%
Overall System Accuracy
Detection Pipeline
How it works
Every submission passes through a five-stage analytical pipeline before an AI
probability score is assigned.
1 Extraction
Text is extracted, tokenised, and language-verified. Formatting
artefacts and metadata noise are stripped before analysis begins.
2 Feature Analysis
Perplexity, burstiness, stylometric profile, n-gram density, and
semantic coherence are measured across the full document.
3 Classification
An ensemble of neural networks and gradient-boosted models processes
all extracted features simultaneously.
4 Scoring
A continuous 0–100% AI probability score is generated. No forced
binary snap-judgements — evidence determines the score.
5 Review
Scores are presented to institutional reviewers. Uncertain cases in
the 20–60% zone are flagged for human adjudication.
What DrillBit looks for
Characteristics of AI-generated text
Large language models leave consistent statistical fingerprints in the text
they produce. DrillBit's engine is trained to detect all of these signals simultaneously.
∼
Low perplexity
AI text is statistically predictable — each word follows high-probability patterns
learned from training data. Human writing contains surprising, lower-probability word
choices that deviate naturally from model expectations.
≋
Low burstiness
Humans write in bursts — short punchy sentences punctuating longer analytical ones.
AI-generated text displays unnaturally uniform sentence lengths, producing a flat
rhythmic signature detectable by length-variance analysis.
♦
Stylometric uniformity
AI output lacks the idiosyncratic punctuation habits, vocabulary preferences, and
syntactic quirks that characterise individual human authors. Stylometric profiling
detects this absence of personal authorial voice.
≡
Semantic overcoherence
Paragraphs produced by LLMs exhibit unnaturally smooth topic transitions and an absence
of the digressive, self-correcting flow typical of genuine human reasoning and academic
argumentation.
⊙
N-gram pattern density
AI models reuse common phrase-level constructions across documents. High-frequency n-gram
matching against a trained reference corpus reveals these repeated structural and
lexical patterns.
◎
Lexical richness flatness
Type-token ratio and lexical diversity measures tend to fall within a narrower band in AI
text than in human writing, which varies considerably based on vocabulary breadth,
register shifts, and individual expression.
Detection Coverage
AI platforms DrillBit detects
Validated across the three dominant AI writing platforms used in academic
contexts, with ongoing updates as new models are released.
| Platform |
Models covered |
Content characteristics |
Detection status |
|
ChatGPT
OpenAI
|
GPT-3.5, GPT-4, GPT-4o |
Fluent, structured academic prose; consistent formal register across disciplines;
strong paragraph organisation. |
Fully supported |
|
Gemini
Google DeepMind
|
Gemini 1.0, 1.5 Pro |
Information-dense output; varied register; strong technical vocabulary; tendency
toward structured enumeration. |
Fully supported |
|
Grok
xAI
|
Grok-1, Grok-1.5 |
Conversational-to-formal range; distinct syntactic patterns; variable formality
across prompt types. |
Fully supported |
|
Paraphrased AI
Any platform
|
Any model with manual or automated paraphrasing applied post-generation. |
AI-generated text with surface-level edits intended to mask origin signals.
Burstiness and stylometric markers often remain detectable. |
Partial — improving |
|
Mixed authorship
Any platform
|
AI-assisted drafting interspersed with human-written passages. |
Hybrid documents where AI and human sections alternate. Represents an emerging
authorship pattern in academic submissions. |
Partial — improving |
Score Interpretation Guide
What does an AI score mean?
DrillBit assigns every document a continuous AI probability score from 0 to
100%. Use this interactive guide to understand exactly what any score means and what action is
appropriate.
Validated Performance
Accuracy you can cite
DrillBit's detection accuracy was validated in a large-scale study across 2.5
million document samples — one of the largest empirical AI detection evaluations published to date.
93%
Human Detection Accuracy
True Negative Rate across 1,000,000 human-authored samples
83%
AI Detection Accuracy
True Positive Rate across 1,000,000 AI-generated samples
88%
Overall System Accuracy
(TP + TN) / Total = 176
/ 200
Common questions
Frequently asked
Can a student be penalised based solely on the AI score? +
No. DrillBit's AI scores are designed as indicators for human review,
not automated enforcement tools. Any institutional action must involve qualified human
assessment of the submission in its full context. The score is one data point — not a
verdict. DrillBit strongly recommends that institutions establish clear AI use policies that
specify how scores are reviewed and what evidentiary standard is applied before any
disciplinary process is initiated.
Why is there an uncertain zone between 20% and 60%? +
This range represents genuine classification ambiguity — content that
exhibits a mixture of human and AI linguistic characteristics. Rather than forcing a binary
result where the evidence is weak, DrillBit surfaces the score transparently and flags these
cases for human review. This design minimises false accusations while maintaining strong
detection at the clearly AI or clearly human extremes. Documents in this range may reflect
post-edited AI content, mixed authorship, or structured human writing styles that partially
overlap with AI output patterns.
What if a student writes in a very formal or structured academic
style? +
Formal human writing can produce elevated AI scores, particularly in
scientific and technical disciplines where conventions require precise, structured prose.
This is a known challenge across all AI detection systems. DrillBit's 20% classification
boundary is calibrated to tolerate structured human writing, and the validated 93% human
detection accuracy confirms this. Reviewers are advised to consider writing style, prior
submission history, and other contextual evidence alongside the AI score before drawing any
conclusions.
Does DrillBit detect AI in languages other than English? +
The current validated evaluation covers English-language documents.
Multilingual AI detection is on our active development roadmap, with Arabic, Spanish,
French, Mandarin, and Hindi prioritised for the next model release cycle. Institutions with
non-English submission requirements are encouraged to contact DrillBit directly to discuss
rollout timelines.
How does DrillBit stay current as new AI models are released? +
DrillBit maintains a continuous retraining pipeline. When significant
new AI models are released publicly, samples generated by those models are collected,
labelled, and incorporated into the next training cycle. Detection performance against new
models is evaluated against held-out test sets before any update is deployed to production.
Institutions are notified of significant model updates through the platform release notes.
Where can I read the full accuracy evaluation methodology? +
DrillBit publishes a full white paper disclosing dataset composition
(2.5 million samples across 8 disciplines), classification boundary conditions, the complete
confusion matrix, and all performance metrics including sensitivity, specificity, and
overall accuracy with formulas. The white paper is available for free download from the
DrillBit resources page — no account required.
What is the minimum document length for reliable detection? +
DrillBit's detection engine is optimised for documents of 500 words or
more — the threshold used in our validation study. Very short documents (under 200 words)
may produce lower-confidence scores because the statistical signals used for classification
require sufficient text to be measurable. For short submissions, scores should be
interpreted with additional caution and reviewer discretion.